Apparatus for classifying or disambiguating data

ABSTRACT

A computing system has a data storage device ( 4, 5, 6 ) for storing a database consisting of a classified vocabulary of terms. A processor ( 1 ) of the apparatus is arranged to associate each term with one of a number of different categories of data and to associate all terms falling within the same category with a common code identifying a collocation of terms that exemplify that category so that terms in different categories are associated with different codes and can be disambiguated. The processor ( 1 ) is arranged to write, directly or indirectly, a classified vocabulary consisting of the terms together with the associated code onto a computer-readable storage medium (RDD 2 ) or to supply an electrical signal via, for example a MODEM ( 10 ) or a LAN/WAN ( 11 ). The database may be used in classification of documents, spelling checking of documents and refining of keyword search results.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a Continuation of and is based upon and claims thebenefit of priority under 35 U.S.C. §120 for U.S. Ser. No. 11/949,571,filed Dec. 3, 2007 which is a Continuation of U.S. Ser. No. 10/990,534,filed Nov. 18, 2004, which is a Continuation of U.S. Ser. No.09/412,754, filed Oct. 5, 1999 and claims the benefit of priority under35 U.S.C. §119 from British Patent Application No. 9821787.0, filed Oct.6, 1998, the entire contents of each which are incorporated herein byreference.

This invention relates to apparatus for classifying or processing data.In particular this invention is concerned with apparatus for enablinguse, storage, disambiguating or manipulating of an item of data inaccordance with a category, for example a subject matter area, withinwhich that item of data is determined to fall.

Classification schemes are used to enable items of data in a particularcategory to be retrieved either from a physical location orelectronically. Various different specific classification schemes exist.Thus, for example, the Dewey Decimal, Universal Decimal and Library ofCongress classification schemes are all used to classify librarymaterial to enable librarians and other people using a library toidentify the location of books and other publications by title, byauthor or by subject matter. In addition, international standardindustry codes exist to classify commercial products and the Whittackersystem classifies living organisms. Each of these existingclassification schemes is thus particular to a certain type of subjectmatter and, moreover, requires that each individual item of data such asa book or publication be manually classified to enable its subsequentretrieval using the classification scheme.

Since such manual classification is a time-consuming and costlyactivity, several attempts have been made to devise a means ofautomatically classifying documents, primarily by comparing words in thedocument with words known to occur frequently in particular subjectareas. Such an approach is described in WO97/10557. Where the words inthe document include sufficient of the frequently-occurring subjectwords, the document is determined to be about that subject. A drawbackto this approach is that when a large number of subject areas areinvolved, the speed of comparison may be slow. It is also the case that,since this approach is based on word frequencies, a document whichcontains unusual words may be classified incorrectly.

The Internet provides, via the world wide web, access to a large amountof data. A number of search engines are available via the world wide webto enable retrieval of documents containing text on a specific topic. Toretrieve documents relating to a specific topic, a keyword (which mayconsist of one or more terms) is entered and the search engine thensearches for documents available electronically via the world wide weband containing that keyword. The results of the search are then collatedand the titles displayed to the user who can then access the individualdocuments. However, such search engines are extremely inefficientfrequently returning very large numbers of ‘hits’ or documents which arenot directly related to the search because, in many cases, it is notpossible to identify precisely the field of enquiry simply by means of akeyword. For example, if the keyword is ‘depression’, documents relatingto each of the meteorological, economic and medical meanings of the term‘depression’ will be retrieved. Some search engines seek to improveresults by offering additional keywords for selection by the user inorder to expand the search term. Such keywords are generally based onfrequency counts and may therefore exclude the required subject area ifthis is less common.

It is an aim of the present invention to provide an apparatus forclassifying terms in a manner which can be universal and which enablesmore efficient and accurate identification and extraction of termsrelating to a specific desired topic or subject matter area, so enablingdisambiguation.

In one aspect, the present invention provides apparatus for storing dataon a computer readable storage medium having means for associating allterms falling within a common category with a common code identifying acollocation associated with that category and means for directly orindirectly writing each term together with the associated code onto acomputer readable storage medium. The writing means may be arranged alsoto write the collocation for the associated code onto the computerreadable storage medium. The writing means may be replaced orsupplemented by signal generating means for generating a signal carryingeach term together with the associated code and optionally also theassociated collocation.

The categories may comprise different subject matter areas which aredesirably sufficient to encompass all data currently available in theworld. Typically, the subject matter areas may be the universe, theearth, the environment, natural history, humanity, recreation, society,the mind, human history and human geography. Each of these subjectmatter areas may be divided into smaller subject matter areas which maythemselves in turn be divided into even smaller subject matter areas.Desirably, each category comprises a combination of a subject matterarea and a species or genus with each item of data being allocated toonly one species or genus. Typically, there may be five species or genuswhich may consist of, for example, people, places, organisations,products and terminology with the latter genus including generalconcepts within the subject matter area. The classification of termsinto both subject matter areas and genera enables efficient and accurateretrieval of terms in a context specific manner and enables adistinction to be made between the use of the same term as the name ofthe person, the name of a place and the name of an organisation, forexample.

In one aspect, the present invention provides apparatus for storing dataon a computer readable storage medium, comprising:

means for storing terms;

means for associating each term with one of a number of differentsubject matter areas;

means for associating each term with one of a number of differentspecies areas such that each item of data is associated with one or moresubject matter areas but only with one species area; and

means for directly or indirectly writing each term onto a computerreadable storage medium in association with a code or codes identifyingthe associated subject matter and species areas.

The writing means may be replaced or supplemented by means forgenerating a signal carrying the same data as is written onto thecomputer readable storage medium.

In one aspect, the present invention provides apparatus for processingdata by determining which of a number of collocations each associatedwith a specific different category is relevant to a received term.

In one aspect, the present invention provides apparatus for checking thespelling of terms in a text which comprises means for determining acategory relevant to the text and means for highlighting or otherwiseidentifying to a user terms which may have been incorrectly used. Suchapparatus may desirably comprise: means storing a vocabulary and meansfor comparing the terms used in the text with the terms in thevocabulary to identify any terms in the text not present in thevocabulary; means for determining, when unknown terms are identified inthe text, likely possible alternative terms in the vocabulary that havethe same category and means for advising a user of the possiblealternative term or terms. Such apparatus may be used as part of a wordprocessing arrangement to check the spelling of terms or words in a textdocument. Such apparatus may also be used to check, where the spellingis correct, that none of the terms used in the text being checked areinappropriate for the determined category of the document.

In one aspect, the present invention provides apparatus for classifyinga text which comprises means for comparing terms used in the text withthe terms used in a classified vocabulary in which classified terms areassociated with categories and means for allocating a classificationcode to the text in accordance with the results of the comparison. Thetext to be classified may be supplied in a computer readable form or maybe optically scanned and then converted into a computer readable formusing known optical character recognition software. Such apparatusenables text to be classified automatically without the need for aperson skilled in the subject matter area of the text or in documentclassification to study the text to determine the subject matter area towhich the text relates.

In one aspect, the present invention provides apparatus for refining theresults of a subject matter search carried out by a search engine usinga keyword, for example an Internet search engine, the apparatuscomprising:

means for accessing a plurality of collocations, each collocation beingassociated with a respective different one of a number of categories;

means for determining whether the keyword falls in one or more of thedifferent categories and, if the keyword used falls within a number ofdifferent categories, advising a user of these different categories;user operable selection means for selecting one of the determinedcategories;

means for comparing the terms used in each text located by the searchwith the terms in the collocation associated with the selected category;and

means for filtering the search results in accordance with the number ofterms the search result texts have in common with the collocationassociated with the selected category.

The present invention also provides a computer usable storage mediumcarrying processor implementable instructions for causing operation ofapparatus according to any of the aspects referred to above.

The present invention also provides a computer readable storage mediumor signal carrying the results of operation of apparatus in accordancewith any one of the aspects referred to above.

Embodiments of the present invention will now be described, by way ofexample, with reference to the accompanying drawings, in which:

FIG. 1 shows a block diagram for illustrating the architecture of acomputer apparatus for use in the present invention;

FIG. 2 shows diagrammatically how terms are divided into subject matterareas or domains;

FIG. 3A shows the structure of an item of data in a classifiedvocabulary;

FIG. 3B shows the structure of an item from a classification scheme dataset;

FIG. 4 shows a flowchart for illustrating use of apparatus embodying theinvention for classifying a text or document;

FIGS. 5 to 9 show diagrammatically the image displayed on a display ofthe apparatus shown in FIG. 1 at various stages in a method embodyingthe invention for refining the results of a search;

FIG. 10 shows a flow chart for illustrating a method embodying theinvention of refining the results of a search;

FIG. 11 shows a flowchart for illustrating use of apparatus embodyingthe invention for checking the spelling of terms in a document; and

FIG. 12 shows a flow chart for illustrating use of apparatus embodyingthe invention for checking for usage of terms in a document.

For ease of understanding definitions of several of the terms or phrasesused herein will now be given.

As used herein the phrase “item of data” means an entry in theclassified vocabulary that includes a term, its description and at leastone of a corresponding category identification and a classificationcode.

As used herein the word “term” means a term which may consist of one ormore words (including made up words, proper nouns, etc.) orabbreviations and which may have one or more different meanings butwhich, for a given meaning, conveys a single concept. It will beunderstood that a single term may have more than one meaning. Thus, forexample, the term “depression” has a number of meanings including ameteorological, a medical and an economic meaning.

As used herein “classification scheme” means the set of subject matterareas or domains and associated genera used to classify terms.

As used herein “category” means a specific combination of the subjectarea and genus in which a term is classified.

As used herein “classification code” means the code allocated to a termand which identifies the category within which the term falls.

As used herein “category identification” means a code unique to aclassification code and a particular collocation.

As used herein “classified vocabulary” means a set of terms classifiedin accordance with the classification scheme.

As used herein “classification data set” means a set of items eachconsisting of a collocation, a characterisation or description of thatcollocation and at least one of the corresponding categoryidentification and classification code.

As used herein “collocation” means a collection of terms (notnecessarily organised in any specific order) that exemplify a categoryof data and which would frequently be found in documents that shouldfall within that category.

As used herein “keyword” means a search term (which may be made up ofone or more words and/or abbreviations) entered by a user.

FIG. 1 shows a computing system which is constructed of conventionalcomponents. In this example, the computing system comprises aconventional personal, for example desktop, computer and associatedperipherals. The computing system could, however, also be a mobilecomputing system such as a lap-top with appropriate peripherals or anin-car system or a larger system such as a minicomputer or mainframedepending upon the user's requirements. FIG. 1 shows a functional blockdiagram of the main elements of the computing system necessary forunderstanding the present invention. It will, of course, be appreciatedthat the computing system will have all the necessary interfaces, busesetc. for enabling correct operation of the computing system.

As shown in FIG. 1, the computing system has a processor 1 for carryingout processor implementable instructions, a random access memory (RAM) 2for storing data and other instructions used by the processor 1, aread-only memory (ROM) 3, a hard disk drive (HD) 4 also for storinginstructions and data usable by the processor 1 and, in this example,two storage devices (RD1 and RD2) 5 and 6 having removable data storagemedia or disks (RDD1 and RDD2) which are shown partly inserted intotheir respective drives in FIG. 1. As an example, one of the datastorage devices 5 and 6 may be a read-only device such as a CD ROM drivewith the removable data storage disk RDD1 providing data and/orprocessor implementable instructions to be read by the processor 1 whilethe other data storage device may be capable of both reading to andwriting from the removable disk RDD2 and may be, for example, a floppydisk drive, a writable or many times writable CD or other optical ormagnetooptical disk drive or a ZIP (Trade Mark) or SPARQ (Trade Mark)magnetic storage type device.

As shown in FIG. 1, the computing system also has a display 7 such as acathode ray tube or liquid crystal display, a user input device ordevices 8 which may comprise both a pointing device such as a mouse anda keyboard, a printer 9, a MODEM 10 for enabling connection to, forexample, the Internet and possibly also a local area or wide areanetwork (LAN/WAN) connection 11 for coupling the computing system in anetwork with other similar computing systems. The computing system mayalso have a scanner 12 which, together with conventional opticalcharacter recognition software stored in, for example, the hard diskdrive 4, enables the computing system to convert paper text documentsinto electronic text documents. The user input device(s) 8 may alsoinclude a microphone and the computing system may have speechrecognition software for enabling vocal input of data or instructions.

FIG. 2 illustrates functionally the overall structure of a databasewhich is accessible by the processor 1 of the computing system from oneof the local data storage devices (such as the hard disk drive 4 or oneof the two removable disk drives 5 and 6) or remotely via the MODEM 10or the LAN/WAN connection 11. The database consists of: 1) aclassification scheme and accompanying classification scheme data set;and 2) a classified vocabulary consisting of classified terms. Block 20in FIG. 2 illustrates schematically the classification scheme. Theclassified terms may relate to any information known in the world andthe classified vocabulary can cover all of the subject matter categoriesof the database shown in FIG. 2. As illustrated in FIG. 2, theclassification scheme classifies terms into ten major subject matterareas or domains 21 with, in this example, the major domains being: theUniverse (UN), the earth (EA), the environment (EN), natural history(NH), humanity (HU), recreation (RE), society (SO), the mind (MI), humanhistory (HH) and human geography (HG).

In the classification scheme, each of these major subject matter areasis divided into subsidiary subject matter areas or subsidiary domains.FIG. 2 illustrates this schematically only for the major subject matterarea UN (the Universe) and partly for the major subject matter area EA(the Earth). As shown in FIG. 2, the subject matter area UN is dividedinto four subsidiary subject matter areas: space exploration (SPA),cosmology (COS), time (TIM), and aliens and other signs ofextraterrestrial life (ALI). Although not shown in this example, each ofthese subsidiary subject matter areas or domains may be itself dividedinto a number of subsidiary subject matter areas or domains which may inturn be divided into further smaller subject matter areas or domains andso on. It will, of course, be appreciated that there are areas ofoverlap between the identified subject matter areas and that some termsmay be classified in more than one subsidiary subject matter area ordomain or even in more than one major subject matter area or domain.

Each (major or subsidiary) subject area or domain has five species areasor genera 23 which are, in this example, people, locations, products,organisations and terminology. The genus ‘product’ includes the names bywhich anything may be sold which will include, in addition to tradenames and trade marks, song and book titles, for example. The genus‘terminology’ includes general concepts in the related subject matterarea or domain. Any one item of data can belong only to one genusalthough it may belong to more than one (major or subsidiary) subjectmatter area or domain. Thus, each meaning of a term in the classifiedvocabulary will be allocated to a specific category in theclassification scheme with the specific category being defined by itsallocated major and subsidiary subject matter areas or domains and itsallocated genus. This facilitates differentiation between use of thesame word as a common noun, a person's name and the name of anorganisation because the database treats the three different meanings ofthe same word as being different terms because they are allocated todifferent ones of the five genera.

To facilitate understanding of the database structure, specific exampleswill be given below.

Thus, a term which relates to space exploration will be classified inthe subsidiary subject matter area or domain (SPA) within the majorsubject matter area or domain (UN). Each classified term within thesubsidiary subject matter area (SPA) will then be allocated to one ofthe five genera. Thus, for example, terms consisting of the names ofastronauts, cosmonauts and mission control personnel will be allocatedto the genus ‘people’ and so to a category defined by the combination ofthe subject matter and the genus with, in this example, a classificationcode: UN SPA SPAP, where the latter four letter term indicates thegenus, that is people (P), in the subsidiary domain SPA. In contrast,terms consisting of the names of space exploration organisations will beallocated to the genus ‘organisations’ and will have a category orclassification code: UN SPA SPAORG where the last three letters of thefinal part of the code indicates that the genus is the organisationgenus.

To take another example, one of the subsidiary subject matter areas ofthe major subject matter area or domain ‘the earth’ is climate (CLI) andthe field of meteorology is classified at: EA CLI. Terms consisting ofthe names of meteorologists are classified in category:earth-climate-people (classification code EA CL CLIP) while the term“the UK meteorological office” is classified in the category:earth-climate-organisations (classification code EA CLI CLIORG). Theterm “UK meteorological office” may also be classified in: humangeography; Europe; UK; organisations (classification code HG EU UKIORG)to enable it to be identified as a UK organisation independently of itsexistence within the field of meteorology.

It will, of course, be appreciated that the above subsidiary subjectmatter areas are examples only and that the person skilled in the artmay adopt or add different subject matter divisions. Generally, however,the ten major subject matter areas or domain will be those given above.Similarly, the five particular genus selected are exemplary and it ispossible that alternative genera may be used. What is, however,important is that all terms are classified in accordance with theclassification scheme with each classified term being allocated to oneor more specific subject matter areas (which may be a subsidiary subjectmatter area within a major or other subsidiary subject matter area) butonly to one specific one of the available genera so as to enabledisambiguation between different meanings of the same word, phrase orabbreviation.

As illustrated schematically by FIG. 3A, each entry 30 in the classifiedvocabulary consists of the classified term 31, a description 32 whichcomprises a word or phrase describing the general nature or subjectmatter area domain of the term, a definition 33 and, in this example, acategory ID (CAT ID) which identifies the category to which the term isallocated. Because the category ID is unique to the classification code,the classification code may be used in place of the category ID in FIG.3A.

Each entry in the classified vocabulary may also include a field 35 forcontaining part of speech (for example noun, verb, adjective, adverb)information to assist in document classification and fields 36 and 37for containing inflected forms and abbreviations and derivatives so thatthe classified vocabulary need contain only an entry for the root termand does not require separate entries for inflections, derivatives andabbreviations.

Two examples of vocabulary entries are shown below. These entries omit,in the interests of clarity, inflections and abbreviations orderivatives.

EXAMPLE 1

Term: Depression. Description: Economics. Definition: A period of lowbusiness and industrial activity accompanied by a rise in unemployment.Classification Code: SO ECO ECOGEN (society- economics-economicterminology). Part of speech: Noun

EXAMPLE 2

Term: Tony Blair. Description: Politician. Definition: UK Politician,leader of the Labour Party and Prime Minister from 1 May 1997.Classification Code: SO POL POLP (society-politics- person). Part ofspeech: Proper nouns.

Each different category (that is each specific combination of subjectmatter subsidiary domain and genus) is associated with a uniqueclassification scheme data set item CL in the classification scheme dataset. FIG. 3B illustrates the basic structure of an item CL in theclassification scheme data set.

Each classification scheme data set item CL includes the correspondingclassification code and collocation for the category and acharacterisation which gives a brief description of the category.

As noted above, the collocation consists of terms that exemplify thecategory and which would frequently be found in documents that shouldfall within the category. For example, a collocation will include termswhich may be used to describe the function, appearance or relationshipwith other objects of the classified terms in the associated category orany other terms (for example ‘buy’, ‘sell’ in relation to cars) whichmay generally be used in the same context as the classified terms. Forexample, where the item of data is the term ‘depression’ in the economicsense, then terms which may be included in the corresponding collocationinclude: economy, employment, low, poor, poverty, market, social,failure, money, jobs etc.

It should, of course, be understood that the classification scheme dataset items CL are in no way the same as the set of sub-headings whichwill generally be found in a standard library classification under eachsubject matter heading. Such sub-headings are analogous to thesubsidiary subject matter domains mentioned above in that they definesubject matter areas or specific topics which fall within the mainheadings. Such sub-headings do not relate to terms which may be used indiscussing or describing items of data falling within the category orheading.

The collocations for the categories recognised within the classificationsystem are determined using a mixture of encyclopaedic andlexicographical criteria. They are not just subject lexicons in theusual sense; for example, as a test case, a collocation lexicon for thecategory of meat within nutrition would include terms for various kindsof meat foodstuffs (lamb, pork, beef, poultry, etc) but also generalwords to do with the category (eat, cook, joint, fat, grilled, etc).

The collocations do not just identify domain A compared with domain B(e.g. meteorology vs literature), but levels of sub-domain within adomain (e.g. literature vs novel vs types of novel). The terms withinthe collocations are derived from three main sources:

1) Encyclopaedic Sources Including:

-   -   i) relevant headwords and words within entries belonging to a        particular domain, as displayed in encyclopedias such as The        Cambridge Encyclopedia, and associated publications; and    -   ii) relevant headwords taken from specialist sources outside of        the above, for example place-names for a particular country from        atlases, environmental terms from the indexes of various        specialised works on the environment.

2) Lexicographic Sources Including:

-   -   i) relevant headwords taken from dictionaries such as the        Chambers Dictionary; and    -   ii) relevant headwords taken from conceptually and        alphabetically organised thesauri.

3) Other Sources Such as:

-   -   i) relevant words found in a set of records after searching a        particular subject matter domain on the Internet;    -   ii) relevant words taken from a frequency listing of words in a        set of Internet records; and    -   iii) human input from a person collating the collocations using        the above information.

The terms providing a collocation may be grouped within the collocation,according to their relevance to the category.

Where a classified vocabulary entry 30 gives, as shown in FIG. 3A, acategory ID rather than the classification code then, as shown in FIG.3B, each classification scheme data set item CL will include theappropriate category ID so that each classified term in the classifiedvocabulary is linked to a unique classification scheme data set item CLby the category ID. As noted above, this linking may be achieved by theclassification codes. However, the use of a separate category ID is moreefficient in computing terms.

The attached Appendix A lists examples of items classified vocabularyentries and the associated classification scheme data set items.

Section 1 of Appendix A lists two entries in the classified vocabularyboth relating to the word ‘bayonet’. The first example given in AppendixA is for the term ‘bayonet’ when used in the term of a light bulbfitting while the second entry is for the term ‘bayonet’ when used inthe context of a camera lens fitting. As can be seen from Appendix A,these two meanings of the term ‘bayonet’ have different category IDswith the category ID for the light bulb fitting being 00010 and thecategory ID for the camera lens fitting being 0020 in this example.

Section 2 of Appendix A shows the classification scheme data set itemsidentified by the category numbers 00010 and 00020. As can be seen fromAppendix A, each classification scheme data set item is headed by itscategory ID followed by the classification code defined by the code forthe main domain followed by the code for each subsidiary domain withthese in turn being followed by the collocation only a part of which isshown in Appendix A for each of the two classification scheme data setitems.

Terms to be classified using the apparatus shown in FIG. 1 may besupplied via one of the removable disk drives, for example on a floppydisk or CD ROM, via the scanner 12 and optical character recognitionsoftware stored on the hard disk 4 or from another similar computer viathe LAN/WAN interface 11 or the MODEM 10. Alternatively or additionally,terms to be classified may be input manually by a user using the inputdevice 8.

Individual terms may be manually classified by the user using the inputdevice. Thus, the processor 1 will first cause the display 7 to displaythe table shown in FIG. 3A. Where the terms are being entered manuallyby the user, the user will first fill in the term in the cell 31 a inFIG. 3A. If, however, the terms to be classified have been alreadysupplied to the processor 1 and stored on the hard disk 4, then theprocessor 1 may be programmed to cause a first one of the terms to bedisplayed in the cell 31 a for classification by the user and then foranother term (for example the next term in an alphabetical order of thedata stored on the hard disk) to be displayed once the user hasclassified the current term and so on. Alternatively, the processor maydisplay all of the stored data on the display 7 and allow the user toselect a term for classification by highlighting it in known manner.

Once the term to be classified has been entered into the cell 31 a, theuser then enters in the cell 32 a a description in the form of a word orphrase describing the general nature or subject matter area of the term.For example, where the term is ‘depression’ in the economic sense asmentioned above, then the description entered by the user may be‘economics’.

Once the user has entered the description, the processor 1 prompts theuser to enter a definition of the specific term into cell 33 a. Wherethe term is ‘depression’ then the user may enter: ‘a period of lowbusiness and industrial activity accompanied by a rise in unemployment’or some other similar short description.

The category ID may be determined manually by the user referring to ahard copy list of the classification codes or may be determined usingthe computer. Thus, for example, the processor may first request theuser to select one of the ten major subject matter areas or domains andthen, once the major subject matter area or domain has been selected,request the user to select one of the available subsidiary domains and,once the subsidiary domain has been selected, a subsidiary domain ofthat domain if it exists, and so on. Once the subject area subsidiarydomain has been determined, the processor may then request the user toselect the required genus. Once the user has done this, then theprocessor 1 determines the classification code and category ID from aclassification code key stored in memory (for example in the ROM 3 or onthe hard disk 4). Once the category ID has been determined and enteredin the cell 34 a, the processor 1 may prompt the user to enter, in turn,data indicating the part of speech in cell 35 a, details of inflectedforms in cell 36 a and details of abbreviations and derivatives in cell37 a. Where the processor 1 has access to a dictionary, for example,where an electronic dictionary is stored on the hard disc drive 4 or ona removable disc inserted into one of the drives RD1 and RD2 or anelectronic dictionary is accessible via the LAN/WAN interface 11 or overthe Internet then the processor 1 may be programmed to determineinflections, abbreviations and derivatives automatically fromelectronically available dictionary sources. Once the data has beenentered in cell 37 a, then the processor 1 may request the user toconfirm that the entry is correct and, once this has been done, theprocessor will store the classified term in the classified vocabulary sothat the category ID determined by the user links the classified term tothe appropriate item in the classification scheme data set.

Once all the desired terms have been classified, the classifiedvocabulary consisting of the classified terms each with theirdescription, definition and category ID may be written onto a removabledisk of the removable disk disk drive 5 or 6 or supplied as a signal to,via a network or the Internet, for example, another computing system. Itwill be appreciated that although the classified vocabulary may changeor need to be updated fairly frequently, updating or changing of theclassification scheme data set may be required less frequently.Accordingly, because the classification scheme data set would generallyconstitute a relatively large amount of data which requires infrequentmodification, the classification scheme data set may be storedseparately from the classified vocabulary, for example on a separate CDROM. It will, of course, be appreciated that the computer apparatusshown in FIG. 1 may not be the original source of the classificationscheme data set subsidiary database but that this may be accessed by theprocessor 1 via a disk inserted into one of the two removable disk diskdrives or via the LAN/WAN interface or via the MODEM 10; for example,the classification scheme data set may be accessed via the Internet fromanother web site.

For convenience, the classified vocabulary and classification schemedata set may both be written by the processor onto a removable diskwhich may be, for example, a writable CD (compact disc) or both besupplied as a signal to another computing system. Where the classifiedvocabulary is specific to one or more of the subject matter areas 21shown in FIG. 2, then it would, of course, be necessary for theprocessor 1 to write to the removable disk or incorporate in the signalonly those items of the classification scheme data set appropriate forthose subject matter areas or domains.

The database described above comprising the classified vocabulary andthe classification scheme consisting of the classification scheme dataset has many applications. For example, once the processor 1 has accessto the classified vocabulary and the classification scheme data set,text documents can be classified automatically using the apparatus shownin FIG. 1.

FIG. 4 shows a flowchart for illustrating automatic classification of atext document.

In order for the computer apparatus to classify a text document it must,of course, be in computer readable form. Where the text document issupplied as an electrical signal via the LAN/WAN 11, the MODEM 10 or viaa removable disk inserted into one of the removable disk disk drives 5and 6, this will already be the case. Where the document to beclassified is not in an electronic form, then the scanner 12 andconventional optical character recognition software may be used toconvert the text document into a form readable by the computer. Asanother possibility, the text may be entered verbally if the computingsystem has speech recognition software.

Whichever way the text document is provided to the computing system, itis first stored on the hard disk 4. The processor 1 then reads thedocument at step S1, matches the terms used in the text document beingclassified against the classified vocabulary at step S2, identifies (atstep S3) the classification codes of the terms found in both theclassified vocabulary and the text document by using the classifiedvocabulary and classification scheme data set (see FIGS. 3A and 3B) andassigns a weighting to each classification code. The processor 1 thendetermines the total weighting for each classification code at step S4to determine the predominant classification code and then, at step S5,restores the text document with the predominant code so that the textdocument is linked with the appropriate classification scheme data setitem.

Weighting of the classification codes may be carried out according to anumber of different parameters and the criteria for assigning aclassification code with confidence will vary from application toapplication. However, one way of weighting the classification codeswhich works well in practice is to assign each term in the text documenta total weighting of one and to divide that total weighting by thenumber of classification codes which may relate to that term so thatwhere a term has a number of different senses (such as the term“depression”, for example) the processor 1 will identify theclassification code for each sense and will assign each classificationcode a weighting of 1/n where n is the number of classification codesidentified for the term. Another approach is for the processor 1 toassign a weighting only to terms associated with the singleclassification codes, however this does not give good results inpractice. Another alternative approach is for the processor 1 to processthe text document sentence by sentence, determine a weightedclassification code for each sentence and then to combine the sentenceclassification codes. Provided the processor 1 has access to someelementary grammatical rules (for example stored on the hard discdrive), this approach enables the processor 1 to take advantage of thepart of speech information in the classified vocabulary to assist indifferentiating between different senses of the same word. Generallyextremely frequent words such as “a”, “the”, “but”, “and”, “can”, “it”etc. will be ignored in step S2.

The description above with reference to FIG. 4 assumes that each textdocument will be allocated to a single category. Generally, however,text documents may be classifiable in more than one subject matter areaand more than one genus. Accordingly, instead of identifying theclassification codes of the classified terms having the most matches atstep S3, the processor 1 may identify each classification code havinggreater than a predetermined percentage of matches according to theweighting and may then determine at step S4 one or more classificationcodes which relate to the document, thereby linking the document to eachof the relevant classification scheme data set items.

The automatic classification software may also provide a user with amechanism for overriding or modifying an automatic allocatedclassification code. For example, the instructions supplied to theprocessor may cause a user to be alerted via the display 7 if theprocessor 1 has been unable to allocate a classification code or codesto the text document, so allowing the user to classify such documentsmanually.

FIGS. 5 to 10 illustrate another example of the use of the databasedescribed above. In this example, the computing system shown in FIG. 1is configured to conduct a search via the world wide web. This isachieved by connection to the Internet via the MODEM 10 and the use of aconventional world wide web browser such as Netscape or MicrosoftExplorer.

Initially, when a user wishes to search for documents relating to aparticular topic, the user activates one of the search engines availableon the world wide web causing a user interface similar to that shown inFIG. 5 to be displayed on the display 7 where the box 40 illustratesdiagrammatically where the logo and other information relating to theselected search engine would be displayed.

Once the user interface has been displayed, the user is prompted toenter the required search keyword in box 41 and then to instigate thesearch by, for example, positioning the cursor using the mouse or otherpointing device over the phrase ‘Search Now!’ and then clicking.

Once the user has initiated the search, the search engine carries outthe search in conventional manner. However, when the search enginereturns the results of the search, the processor 1 intercepts and storesthese before displaying them to the user and reads the search keywordinput by the user (step S6 in FIG. 10). Although not shown in thefigures, at this stage the processor 1 may inform the user via thedisplay 7 that the search results have been received and give the userthe option of continuing on-line or storing the results of the search soas to minimise on-line time and thus charges.

The processor 1 then checks the classified vocabulary of the databasefor matches to the keyword used to initiate the search (step S7). Wherematches in different categories (which may or may not be genus specific)are identified, the processor 1 reads the description from theclassified vocabulary for each term and displays it to the user with arequest for the user to select the category required (step S8). FIG. 6illustrates an example of this user interface. As shown in FIG. 6, thekeyword entered by the user was ‘AA’ and three defined subject matterdomains were identified-health, roads and weapons. In addition to these,the processor 1 causes the display 7 to give the user the option ofselecting the domain ‘other’, that is an undefined domain which is noneof the identified domains.

The user interface prompts the user to enter the desired domain in box42 in FIG. 6 or, if he is unsure of the desired domain, to click on thedomain name for a definition. If a definition is requested (step S9) theprocessor then displays the selected definition on display 7 (step S10).FIGS. 7, 8 and 9 show, respectively, the subsequent screens which wouldbe displayed if the user clicked on health, roads or weapons,respectively. As will be appreciated, each of these displays shows thedefinition stored in the classified vocabulary for the term in thatdomain.

If the user enters the required domain in FIG. 6 by typing in health,roads, weapons or other or selects the domain from the definition screen7, 8 or 9 by clicking on the words ‘Select Domain’ (that is the answerat step S11 is yes), then the processor 1 calls up the collocation ofthe classified scheme data set item for the selected domain and searchesat step S12 for the use of terms listed in the collocation in thedocuments forming the search results.

The processor then determines at step S13 which of the search resultsdocuments have at least a predetermined number of matches with thecollocation terms and then displays to the user at step S14 only thosesearch results documents having at least the predetermined number ofcollocation terms. If the domain ‘other’ is selected, the processorlists those documents not containing (or containing the least number of)terms used in the collocations associated with the other three domains.The processor may order the search results in accordance with the numberof matches with the collocation terms of the selected domain and maylist all of the search results in an order determined by the number ofmatches with the selected collocation with the highest number of matchesbeing listed first or may display a given number of the search resultsfor example the first ten or twenty search results to the user.

By using the collocations, the processor 1 can disambiguate differentmeanings of the same word and the search results produced by the searchengine can be refined so as to select only those documents which useterms relevant to or which would be used in discussing or describing thekeyword in the subject matter area or domain selected by the user. Thus,the search results relating to the use of the term ‘AA’ in subjectmatter areas different from the one selected by the user can be filteredout so that, for example, if the user selects the domain: ‘AA:HEALTH’,he will be provided with only the documents relating to AlcoholicsAnonymous and not documents relating to the Automobile Association oranti-aircraft weapons.

A further application of the database will now be described withreference to FIGS. 11 and 12.

Commonly used software applications such as word processors, databasesand spreadsheets need to be able to validate words. However, currentspelling checkers are extremely limited in their application. Forexample, most current spelling checkers cannot identify place names,product names, company names and the names of people, particularlysurnames, where these words are not also common nouns.

The spelling checkers of such word processors, database and spreadsheetsmay, however, be modified using the apparatus described above and theclassification scheme data set to enable far more accurate verificationof text.

In this example, the dictionary of a conventional spelling checker isreplaced by the database described above. When instructed to verify thetext, the processor first reads the document at step S20, compares theterms used in the document with the classified vocabulary of thedatabase at step S21, identifies at step S22 any terms not in thevocabulary then matches at step S23 the document terms against the termsin classified vocabulary so as to determine at step S24 the domainhaving the most matches so as to determine the subject matter area andso the classification code of the document. This is carried out in asimilar manner to the automatic document classification discussed abovewith reference to FIG. 4. Steps S21 and S22 may be carried out aftersteps S23 and S24.

Once the subject matter area of the document has been determined, theprocessor 1 at step S25 checks for terms in classified vocabulary whichhave the same classification code as that allocated to the document andare closest in spelling to the unknown term and displays these to theuser at step S26. This enables the selection of the possiblealternatives for the unknown word or term to be specifically directedtoward the subject matter of the document being checked so thatinappropriate alternatives are not presented.

FIG. 12 shows a flowchart illustrating a modification of the processdescribed with reference to FIG. 11. In the modification shown in FIG.12, after the processor 1 has identified any terms not in the classifiedvocabulary at step S22, the processor 1 identifies at step S27 theclosest terms or most likely terms in the vocabulary regardless of theirclassification code, that is regardless of their subject matter area ordomain and then displays these closest terms to the user at step S28 viathe user interface. At this time, as indicated by step S29, theprocessor also requests the use, via the user interface, to selectwhether or not context specific identification of possible closest termsis required. If the answer is no, then the spell checking is terminatedat step S30. If, however, the answer is yes, then the processor proceedsto steps S24 to S26 as discussed with reference to FIG. 11. This enablesthe user to select whether or not context or subject matter specificselection of possible alternatives for the unknown word is required.

The above description suggests that a single general database consistingof the classified vocabulary and the accompanying classification schemedata set will be provided. This need, however, not be the case. Rather,the contents of the database provided may be specific to therequirements of the user with, for example, a particular user perhapsonly being provided with the classified vocabulary for a specificsubject matter area or areas and the associated classification schemedata set item or items. Additionally, the general database or a specificsuch database may be supplemented by additional classified termsspecific to a particular user's requirements. Thus, individual lists ofspecialist classified terms may be prepared and supplied together withrelated items of the classification scheme data set. Examples of suchspecialist classified vocabulary lists are, for example, lists ofpharmaceutical compound names and chemical names for the pharmaceuticalindustry, specialist lists of persons involved in a specific field, forexample a list of all recognised chemists in a particular field or allrecognised scientists such as, for example, people like Einstein,Oppenheimer, Newton etc.

Such classified lists may provide a key to standardised data andtherefore greatly improve retrieval of data from a database. At present,some companies may have their own internal standards or authority filesto ensure that employees are using the same terminology but with thegrowing use of the Internet and intranets there is a fast growing needfor standard data than can be used for all organisations around theworld. Classified lists provide a powerful way of establishing standardspecialist vocabularies. Such specialist vocabulary classified lists maybe used, for example, to supplement word processing spell checkers suchas those described above with reference to FIGS. 11 and 12. For example,the pharmaceutical industry may be provided with one or more classifiedlists listing the chemical and trade names of pharmaceuticals andrelated terminology. Other classified lists may include specialist listsof persons recognised in a particular field, for example recognisedphysicists or chemists or a classified list which enables differentlanguage versions of the same name to be identified (for example Viennaand Wien) for example to facilitate postal services.

The apparatus described above may also be used to index documents. Thus,for example, where specialist classified lists are provided, thendocuments in the field of the specialist classified list may be indexedin accordance with that list. For example, the processor 1 may indexdocuments in the field of chemistry in accordance with the names ofrecognised chemists appearing in those documents by comparing the termsused in the documents with specialist classified lists accessible to theprocessor 1 and then indexing each document under each term in thespecialist classified list identified in the document. This wouldenable, for example, a researcher to identify all papers published by aspecific person identified in the classified list or to extract alldocuments referring to each of a number of persons identified in theclassified list.

As noted above, because the database is classified both as to subjectmatter and as to genus, it enables the processor 1 to validate wordsincluding proper nouns which are stored in the classified vocabulary, todifferentiate between semantic items, for example the use of the word‘wood’ as a surname or as a material, to identify the use of commonterms as also being names of products, to provide via the classifiedlists variants on forms or spellings of names such as Vienna/Wien and toprovide, again via the classified lists, lists of specialist terms forexample all chemical compounds, all mathematicians, all units ofcurrency as required by the end user. Moreover, because theclassification scheme is modular, an end user may be supplied with onlya part of the classified vocabulary specific to his particular needswith the associated classification scheme data set items without havingto make any modifications to the classified vocabulary. Furthermore, thesubject matter areas or domains can easily be refined by the addition ofdeeper and deeper levels of subsidiary domains without disturbing theoverall structure of the database.

The classified vocabulary or items of data may be provided in differentlanguages. Different classification scheme data sets will however berequired for different languages because there is not always a directcorrelation in meaning. The apparatus described above may be used toassist in translation of documents. In order to achieve this, theapparatus is given access to two different language versions of thedatabase and to an electronically stored conventional dictionaryproviding translations of the source language into the required finallanguage. In order to assist in the translation of the document, theapparatus first determines, in a manner similar to that described abovewith reference to FIG. 4, the category within which the source languagedocument falls by comparing the terms used in the source languagedocument against the source language classified vocabulary. Once thecategory of the document has been determined, the processor then looksup the translation of each word in the document using the electronicdictionary and, where a number of alternative translations are looks upthe translation in the final language database and selects as thetranslation the term having the same category as the source term. Ofcourse, the apparatus will generally not be used to provide an automatictranslation of a document but simply to provide the user of theapparatus with a translation of the term which is specific to thecontext of the document to assist the user in preparing a more accuratetranslation. As another possibility, a first database consisting of avocabulary of terms in one language and an associated classificationscheme data set in that language may be associated within a seconddatabase consisting of a vocabulary of terms in a second language withthe terms in the second vocabulary being associated with the samecollocations as the first database. An apparatus provided with suchdatabases would then be able to, at the request of a user, provide theuser with a translation of a term in the document by determining thecollocation associated with that term and then determining whichpossible translation of the term is associated with the samecollocation. Such an arrangement could be associated with theabove-mentioned classified list to provide or improve a foreign languagedictionary.

As noted above as used herein, the term ‘collocation’ means a collectionor list of terms which exemplify the domain or category with which thecollocation is associated. However, the collocations may be ranked sothat the terms within each collocation are arranged in order ofsignificance. For example, the terms used in the collocation may besplit into a number of groups of terms with the groups of terms beingordered in accordance with their significance to the domain with whichthey are associated. This would enable, where necessary or desired,limited numbers of the groups of terms to be used by the computingsystem. Limiting the number of terms in the collocation which areactually used in practice to those of most significance in relation tothe subject matter area should facilitate more rapid carrying out by thecomputing system of the processors described above, for example,searching, classification or spell checking, with only a slightdegradation in accuracy.

The classification scheme discussed above with reference to FIG. 2 maybe associated with existing classification schemes. Thus, for example, alink may be provided between a particular subsidiary subject matter areaor domain and an existing specialist classification scheme for thatarea. For example, a subsidiary subject matter area or domain directedtoward patents may be linked to the international patent classificationsystem and the subsidiary subject matter area relating to livingorganisms may, for example, be linked to the Whittacker system to enableadvantage to be taken of the specialist information in thoseclassification systems.

Although in the arrangements described above, each specific category isassociated with a particular classification scheme data set item andthus with a specific collocation, items of data of different genus butfalling within the same subject matter area or domain may share acollocation because frequently the same terms will be used in relationto items of data falling within different genus in the same subjectmatter area.

In the arrangement described above with reference to FIGS. 4, 11 and 12,the classified vocabulary is used to determine the category of adocument. As another possibility, the terms used in a document to beclassified may be compared against the collocations. This requires,however, that the text document be compared against each collocation inturn and then the collocation having the most number of matches beidentified to determine the predominant category for the document. Thisapproach relies on a fixed body of data and, because each collocation isspecific to a category and each collocation has to be tested in turn,tends to be less accurate and takes longer to classify the document. Incontrast, using the classified vocabulary which encompasses all subjectmatter areas of the database (possibly minus any extremely common orfrequently used words such as “it”, “an”, “a”, “and”, “but”, “can”, “do”and so on) provides for greater flexibility and moreover results inquicker and more accurate classification of the vocabulary. It ispreferred that the classified vocabulary be used for the documentclassification and the collocations be used for disambiguation such asin the case of the example described above with reference to FIGS. 5 to10.

In the above examples, the classified vocabulary consists of classifiedterms. Conceivably, however, the classified vocabulary may be images,music or other sounds or non-textual matter. Of course, manualclassification will be necessary if the items of data are notaccompanied by related text.

It will be appreciated that the processor implementable instructions forcausing the processor 1 to carry out any of the operations describedabove may be supplied via a storage medium insertable into a removabledisk disk drive as discussed above. Alternatively, or additionally, thecomputer or processor implementable instructions can be supplied as asignal by, for example, downloading the code over a network which may bean intranet or the Internet. An aspect of the present invention thusprovides a storage medium storing processor implementable instructionsfor controlling the processor to carry out one or more of the processesdescribed above. Another aspect of the present invention provides anelectrical signal carrying processor implementable instructions forcontrolling the processor to carry out one or more of the methodsdescribed above.

As noted above, the database for use by the apparatus may be supplied ona storage medium insertable into one of the removable disk disk drivesor may be accessed remotely as a signal downloaded over a network suchas the Internet or an intranet. Also, the classification scheme data setmay be supplied separately from the classified vocabulary or items ofdata. The present invention thus also provides a storage medium storinga classified vocabulary or items of data and/or the classificationscheme data set or items therefrom as discussed above. The presentinvention also provides an electrical signal carrying a classifiedvocabulary and/or the or some of the items from the classificationscheme data set as discussed above.

In one aspect, the present invention provides apparatus for storing dataon a computer readable storage medium, comprising:

means for storing items of data;

means for associating each item of data with one of a number ofdifferent categories of data;

means for associating all items of data falling within the same categorywith a common code identifying a collocation of terms that exemplifythat category so that items of data in different categories areassociated with different codes identifying different collocations ofterms with each collocation of terms being specific to the associatedcategory; and

means for directly or indirectly writing each item of data together withthe associated code onto a computer readable storage medium.

In one aspect, the present invention provides apparatus for storing dataon a computer readable storage medium, comprising:

means for storing items of data;

means for storing a plurality of different collocations of terms withthe terms in each different collocation being terms that exemplify aspecific different one of a plurality of categories of data;

means for associating each item of data with one of said number ofdifferent categories of data;

means for associating all items of data falling within the same categorywith a common code identifying which one of said collocations containsterms that exemplify items of data in that category so that items ofdata in different categories are associated with different codesidentifying different ones of said collocations of terms; and

means for directly or indirectly storing the plurality of collocationsand each item of data together with its associated code onto a computerreadable storage medium.

In one aspect, the present invention provides apparatus for storing dataon a computer readable storage medium, comprising:

means for storing items of data;

means for associating each item of data with one of a number ofdifferent species of data and one of a number of different subjectmatter areas such that the associated species and subject matter areadefine a category for that item of data;

means for associating all items of data falling within the same categorywith a common code identifying a collocation of terms that exemplifythat category so that items of data in different categories areassociated with different codes identifying different collocations ofterms with each collocation of terms being specific to the associatedcategory; and

means for directly or indirectly writing each item of data together withthe associated code onto a computer readable storage medium.

In one aspect, the present invention provides apparatus for storing dataon a computer readable storage medium, comprising:

means for storing items of data;

means for storing a plurality of different collocations of terms withthe terms in each different collocation being terms that exemplify itemsof data falling within a specific different combination of one of anumber of different species of data and one of a number of differentsubject matter areas such that the associated species and subject matterarea define a category for that item of data;

means for associating each item of data with a category;

means for associating all items of data falling within the same categorywith a common code identifying which one of said collocations containsterms exemplifying items of data in that category so that items of datain different categories are associated with different codes identifyingdifferent ones of said collocations of terms; and

means for directly or indirectly storing the plurality of collocationsand each item of data together with its associated code onto a computerreadable storage medium.

In one aspect, the present invention provides apparatus for processingcomputer usable data, comprising:

means for storing items of data;

means for associating each item of data with one of a number ofdifferent categories of data;

means for associating all items of data falling within the same categorywith a common code identifying a collocation of terms usable in relationto items of data in that category so that items of data in differentcategories are associated with different codes identifying differentcollocations of terms with each collocation of terms being specific tothe associated category; and

means for generating a signal carrying each item of data together withits associated code for supply to a computer readable storage medium.

In one aspect, the present invention provides apparatus for processingcomputer usable data, comprising:

means for storing items of data;

means for storing a plurality of different collocations of terms withthe terms in each different collocation exemplifying items of datafalling within a specific different one of a plurality of categories ofdata;

means for associating each item of data with one of said number ofdifferent categories of data;

means for associating all items of data falling within the same categorywith a common code identifying which one of said collocations containsterms exemplifying items of data in that category so that items of datain different categories are associated with different codes identifyingdifferent ones of said collocations of terms; and

means for generating a signal carrying each item of data together withits associated code for supply to a computer readable storage medium.

In one aspect, the present invention provides apparatus for processingcomputer usable data, comprising:

means for storing items of data;

means for associating each item of data with one of a number ofdifferent species of data and one of a number of different subjectmatter areas such that the associated species and subject matter areadefine a category for that item of data;

means for associating all items of data falling within the same categorywith a common code identifying a collocation of terms usable in relationto items of data in that category so that items of data in differentcategories are associated with different codes identifying differentcollocations of terms with each collocation of terms being specific tothe associated category; and

means for generating a signal carrying each item of data together withits associated code for supply to a computer readable storage medium.

In one aspect, the present invention provides apparatus for storing dataon a computer readable storage medium, comprising:

means for storing items of data;

means for storing a plurality of different collocations of terms withthe terms in each different collocation exemplifying items of datafalling within a specific different combination of one of a number ofdifferent species of data and one of a number of different subjectmatter areas such that the associated species and subject matter areadefine a category for that item of data;

means for associating each item of data with a category;

means for associating all items of data falling within the same categorywith a common code identifying which one of said collocations containsterms usable in relation to items of data in that category so that itemsof data in different categories are associated with different codesidentifying different ones of said collocations of terms; and

means for generating a signal carrying each item of data together withits associated code for supply to a computer readable storage medium.

In one aspect, the present invention provides a computer usable mediumhaving computer readable instructions stored therein for causing thecomputer:

to associate each of a plurality of items with one of number ofdifferent categories;

to associate all the items of data falling within the same category witha common code identifying a collocation of terms exemplifying items ofdata in that category so that items of data in different categories areassociated with different codes identifying different collocations ofterms with each collocation of terms being specific to the associatedcategory; and

to generate a signal carrying each item of data together with itsassociated code for supply to a computer readable storage medium.

In one aspect, the present invention provides a computer usable mediumhaving computer readable instructions stored therein for causing thecomputer:

to associate each of a plurality of items of data with one of a numberof different species of data and one of a number of different subjectmatter areas such that the associated species and subject matter areadefine a category for that item of data;

to associate all items of data falling within the same category with acommon code identifying a collocation exemplifying items of data in thatcategory so that items of data in different categories are associatedwith different codes identifying different collocations of terms witheach collocation of terms being specific to the associated category; and

to generate a signal carrying each item of data together with itsassociated code for supply to a compute readable storage medium.

In one aspect, the present invention provides a computer usable mediumhaving computer readable instructions stored therein for causing thecomputer:

to associate each of a plurality of items of data with one of a numberof different categories of data;

to associate all items of data falling within the same category with acommon code identifying a collocation of terms exemplifying items ofdata in that category so that items of data in different categories areassociated with different codes identifying different collocations ofterms with each collocation being specific to the associated category;and

directly or indirectly to write each item of data together with theassociated code onto a computer readable storage medium.

In one aspect, the present invention provides a computer usable mediumhaving computer readable instructions stored therein for causing thecomputer:

to associate each of a plurality of items of data with one of a numberof different species of data and one of a number of different subjectmatter areas such that the associated species and subject matter areadefine a category for that item of data;

to associate all items of data falling within the same category with acommon code identifying a collocation of terms exemplifying thatcategory so that items of data in different categories are associatedwith different codes identifying different collocations of terms witheach collocation of terms being specific to the associated category; and

directly or indirectly to write each item of data together with theassociated code onto a computer readable storage medium.

In one aspect, the present invention provides apparatus for processingdata comprising:

means for accessing from storage means a plurality of collocations ofterms with each collocation being associated with a different categoryof data and containing terms exemplifying that category;

means for receiving items of data;

means for determining a collocation which is relevant to a received itemof data; and

means for processing the received item of data using terms from thatcollocation.

In one aspect, the present invention provides apparatus for checking thespelling of terms in a text, comprising:

means for receiving the text to be checked;

means for accessing first storage means storing a plurality of differentcollocations of terms with the terms in each collocation being usable inrelation to a particular different category;

means for accessing second storage means storing a vocabulary with eachterm in the vocabulary being associated with a respective codeidentifying a specific one of said different collocations and a specificcategory for each different context or meaning of the term;

means for comparing the terms used in the text with the terms in thevocabulary to identify any terms in the text not present in thevocabulary;

means for, when unknown terms not present in the vocabulary areidentified, comparing the rest of the terms in the text with the termsin the collocations to determine the collocation which has terms mostclosely matching the terms of the text to determining the category towhich the text should be allocated;

means for determining any term in the vocabulary associated with thedetermined category for which the unknown term may be a misspelling; and

means for advising a user of the determined term(s).

In one aspect, the present invention provides apparatus for classifyinga text into one of a number of different subject matter categories,comprising:

means for receiving the text to be classified;

means for accessing storage means storing a plurality of differentcollocations of terms with the terms in each collocation being usable inrelation to a particular subject matter category and each collocationbeing associated with a classification code identifying the particularsubject matter category to which the collocation is relevant;

means for comparing terms used in the text with the terms in thecollocations;

means for determining which of the collocations has the most terms incommon with the text being classified; and

means for allocating to the text the classification code associated withthe determined collocation.

In one aspect, the present invention provides apparatus for refining theresults of a subject matter search carried out by a search engine usinga keyword, comprising:

means for accessing first storage means storing a plurality of differentcollocations of terms with the terms in each collocation exemplifying aparticular different subject matter category;

means for accessing second storage means storing a vocabulary with eachterm in the vocabulary being associated with a respective codeidentifying a specific one of said different collocations and a specificcategory for each different context or meaning of the term;

means for receiving the results of the subject matter search;

means for comparing the keyword used to carry out the search with theterm in the vocabulary to determine each category with which the keywordis associated;

means for advising a user of the different categories with which thekeyword is associated;

user operable selection means for selecting one of the categories withwhich the keyword is associated;

means for comparing the terms used in text in each of the search resultswith the collocation of terms of the selected category; and

means for advising the user of the search results for which the text hasgreater than a predetermined number of terms in common with thecollocation for the selected category.

In one aspect, the present invention provides apparatus for checking theusage of terms in a text, comprising:

means for receiving the text to be checked;

means for accessing first storage means storing a classified vocabularyin which the terms are allocated to categories;

means for comparing terms in the text with the terms in the classifiedvocabulary to determine a category for the text; and

means for identifying any terms not in the classified vocabulary; and

means for advising the user of any term in the classified vocabularysimilar to an unidentified term and having the determined category.

Other modifications will be apparent to those skilled in the art.

Appendix A: Data Samples

1. Classified Vocabulary

-   TERM bayonet-   DESCRIPTION technology-   DEFINITION type of fitting for a light bulb in which prongs on its    side fit into slots to hold it in place-   CAT ID 00010-   TERM bayonet-   DESCRIPTION Photography-   DEFINITION type of fitting for a camera lens in which prongs on its    side fit into slots to hold it in-   CAT ID 00020

2. Classification Scheme

-   CAT ID=00010-   DOMAIN MI SUBDOMAIN TEC SUBDOMAIN POW SUBDOMAIN POWGEN COLLOCATIONS;    A; AF; AGR; CAD; Calor gas; EP; P; acceptor; accident; accumulator;    acoustic coupler; actuator; adapter; adaptor; advanced gas-cooled    reactor; afterdamp; alternating current; alternator; ambisonics;    ammeter; amp; amplification; amplifier; analogue-to-digital    converter; anode; anthracite; antinuclear; armature; audio;    audiometer; bank; barrel; battery; bayonet; bell; bezel; binaural;    biological shield; bipolar; bipolarity; blackout; bleep; blip;    bloop; blow-out; blow; boiler; booster; bore; borehole; bowser;    brakeman; brakesman; brazier; breadboard; break; breed; breeder    reactor; bridge; briquet; briquette; bromine; brush; bulb; bunker;    burn-up; butane; button cell battery; button cell; buzzer; bypass;    cable; cage; candle; capacitor; capstan; ceramic stratus; chemical    laser; codec; coder/decoder; cut-out; cut; damp; damper; deck;    derrick; diaphragm; diesel; diffuser; disc; discharge; dross; earth;    electro; element; envelope; excitant; exciter; excitor; fantail;    feedback; feeder; fender; fidelity; filament; filter; fireman;    flasher; flashlight; flip side; flip-flop; fuel; fuse; gain; gap;    gas; gate; geyser; kieselguhr; oiler; outage; paraffin . . .-   <CAT    ID=00020><BRANCH><DOM>MI<SUBDOM>TEC<SUBDOM&g-t;OPT<SUBDOM>OPTGEN</BRANCH><COLLS>;    Betacam; Betamax; Brownie; Calotype; Overcoat; PAL; aberration;    achromat; achromatic; adaptive optics; aliasing; amplifier;    anaglyph; anamorphic lens; aperture synthesis; aperture; apochromat;    aspect ratio; atomic force microscope; autofocus; automatic    exposure; autotype; b/w; back projection; bath; bayonet; bellows;    bifocal; binocular; black and white; blimp; blow-up; blue-backing    shot; box camera; bromide paper; bromine; bromoil; bull's-eye;    camcorder; camera lucida; camera obscura; camera; carbro; color    cinematography; color negative; colorization; colour cinematography;    colour negative; conforming; coronagraph; couplers; daguerreotype;    develop; developer; diaphragm; dolly; emulsion; exposure; film;    filter; fix; fixer; flash; flashlight; flood; fog; frame;    freezeframe; gauge; ghost; meniscus; microdot; mil; monitor; mount;    negative; nosepiece; objective; ocular; opaque; pan . . . .

1. A computer processing apparatus for classifying a document,comprising: means for accessing a database structure providing aplurality of different subject matter categories, the databasecontaining a classified vocabulary consisting of terms in all of thedifferent subject matter categories with each term being classified inaccordance with the subject matter category structure of the database;means for receiving in computer-readable form a text documents to beclassified; processor means operable to compare terms appearing in thetext document with the terms in the classified vocabulary and todetermine from the comparison the category for the document; and meansfor supplying a signal carrying data representing the text document anddata associating the text document with the determined category.